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Abstract 



We present a new parallel code for computing the dynamical evolution of col- 
lisional iV-body systems with up to N ~ 10 7 particles. Our code is based on the 
Henon Monte Carlo method for solving the Fokker-Planck equation, and makes 
assumptions of spherical symmetry and dynamical equilibrium. The principal 
algorithmic developments involve optimizing data structures, and the introduc- 
tion of a parallel random number generation scheme, as well as a parallel sorting 
algorithm, required to find nearest neighbors for interactions and to compute 
the gravitational potential. The new algorithms we introduce along with our 
choice of decomposition scheme minimize communication costs and ensure opti- 
mal distribution of data and workload among the processing units. Our imple- 
mentation uses the Message Passing Interface (MPI) library for communication, 
which makes it portable to many different supercomputing architectures. We 
validate the code by calculating the evolution of clusters with initial Plummer 
distribution functions up to core collapse with the number of stars, N, spanning 
three orders of magnitude, from 10 5 to 10 7 . We find that our results are in good 
agreement with self-similar core-collapse solutions, and the core collapse times 
generally agree with expectations from the literature. Also, we observe good total 
energy conservation, within < 0.04% throughout all simulations. We analyze the 
performance of the code, and demonstrate near-linear scaling of the runtime with 
the number of processors up to 64 processors for N = 10 5 , 128 for iV = 10 6 and 
256 for N = 10 7 . The runtime reaches saturation with the addition of processors 
beyond these limits, which is a characteristic of the parallel sorting algorithm. 
The resulting maximum speedups we achieve are approximately 60 x , 100 x , and 
220 x, respectively. 
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Subject headings: Methods: numerical, Galaxies: star clusters: general, globular 
clusters: general, Stars: kinematics and dynamics 
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1. Introduction 

The dynamical evolution of dense star clusters is a problem of fundamental importance 
in theoretical astrophysics. Important examples of star clusters include globular clusters, 
spherical systems containing typically 10 5 - 10 7 stars within radii of just a few parsec, 
and galactic nuclei, even denser systems with up to 10 9 stars contained in similarly small 
volumes, and often surrounding a supermassive black hole at the center. Studying their 
evolution is critical to many key unsolved problems in astrophysics. It connects directly 
to our understanding of star formation, as massive clusters are thought to be associated 
with major star formation episodes, tracing the star-formation histories of their host 
galaxies. Furthermore, globular clusters can trace the evolution of galaxies over a significant 
cosmological time span, as they are the brightest structures with which one can trace the 
halo potential out to the largest radii, and they are very old, potentially even predating the 
formation of their own host galaxies. Unlike stars and planetary nebulae, globular clusters 
are not simply passive tracers of galaxy kinematics as their internal dynamics are affected 
by the galactic tidal field. Therefore, their internal properties and correlations with their 
host galaxies are likely to contain information on the merger history of galaxies and haloes. 

Dynamical interactions in dense star clusters play a key role in the formation of many of 
the most interesting and exotic astronomical sources, such as bright X-ray and gamma-ray 
sources, radio pulsars, and supernovae. The extreme local stellar densities, which can reach 
> 10 6 pc -3 , give rise to complex dynamical processes: resonant stellar enc ounters, tidal 



20031 ). The primary 



captures, physical collisions, and high-speed ejections (jHeggie fc Hut 
challenge in modeling dense clusters lies in the tight coupling of these processes and their 
scales as they influence and inform one another both locally, e.g., through close encounters 
or collisions on scales of ~ 1 — 100 R , or 10 -8 — 10 -6 pc, and globally on the scale of 
the whole system through long-range, gravitational interactions. Close binary interactions 
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can occur frequently, every ~ 10 6 — 10 9 yr depending on the cluster density, relative to 
the global cluster evolution timescale. Furthermore, in the time between close encounters, 
stellar particles, single and binary, change their physical properties due to their internal 
nuclear evolution and due to mass and angular momentum transfer or losses. All these 
changes affect the rates of close encounters and couple to the overall evolution of the cluster. 

Given these enormous ranges in spatial and temporal scales, simulating dense star 
clusters with a million stars or more is a formidable computational challenge. A thorough 
analysis of the s caling of the computational cost of direct iV-body methods is presented in 



Hut et al. 



(I1988I ). Although direct iV-body methods are free of any approximations in the 



stellar dynamics, their steep oc N 3 sca l ing; has limited simula tions to an initial N ~ 10 5 stars 



( iZonoozi et al. 



2011 



Jalali et al. 



2012 



Hurley fc Shara 



20121 ). However, the number of stars 



in real systems like globular clusters and galactic nuclei can be orders of magnitude larger 
Even for globular cluster size systems where the evolution is feasible to calculate with a 



direc t iV-body code, the total runtime "takes the better half of a year" (IHurley fc Shara 



20121 ) and statistical results have to rely on only a very few models. This is a problem, given 



the significant in 
parameters (e.g., 



lerent stochasticity of these systems, wh i ch affects even basic s tructural 



Heggie fc Giersz 



2009 



Trenti et al. 



2010; 



Hurley fc Shara 



20121 ). In order 



to draw statistically robust conclusions, a much larger number of realizations of massive 
star clusters has to be calculated, in addition to a wider range of initial conditions. It is 
clear that these requirements result in prohibitive runtimes for direct iV-body codes. 

Monte Carlo methods calculate the dynamical evolution of the cluster in the Fokker- 
Planck approximation, which applies when the evolution of the cluster is dominated by 
two-body relaxation, and the relaxation time is much larger than the dynamical time. 
In practice, further assumptions of spherical symmetry and dynamic al equilibrium have 



to be made. The Henon Monte Carlo (MC) technique (jHenon 



19711) which is based on 
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orbit averaging, represents a balanced compromise between realism and speed. The MC 
method allows for a star-by-star realization of the cluster, with its N particles representing 
the N stars in the cluster. Integratio n is done on the relaxation timescale, and the total 



computational cost scales as N log N (IHenon 



19711 ). 



Our work here is based on the Henon-type MC cluster evolution co d e CMC ("Clus ter 



Monte Carlo"), developed over many years by 



f)2003h : 



Fregeau fc Rasiol (120071 ): 



Jos 



Chatterj ee et al 



li et al. 



teiq); 



2000 



2001); 



Umbreit et al 



regeau et al. 



torn . CMC 



includes a detailec 



( Fregeau fc Rasio 



( jChatterjee et al. 



treat ment of strong binary star interactions and physical stellar collisions 



massive black hole (jUmbreit et al. 



2007), as well as an implementation of single and binary star evolution 



2010 ) and the capabi lity of handling the dynamics around a central 



20121 ). 



In addition to CMC, other MC codes have been developed recently that are based on 
the same orbit averaging technique. Apart from differences in how the stellar and binary 
process have been im plemented, these codes mainly differ in how particles are advanced 



in time. The code of 



Freitag fc Benzl ( 120011 ) uses an individual timestep scheme, where 



each particle is advanced on its own, local relaxation timescale, while the code of 
with its newest version described in 



(1998 



Gicrsz 



Giersz et al. 



( 120111 )) uses a block timestep 



scheme, where the cluster is divided into several radial zones, and particles are evolved 
on the average relaxation timescale of the corresponding zone. While they provide better 
adaptability to density contrasts, individual and block timestep schemes are more difficult 
to parallelize efficiently in a distributed fashion. A shar ed timestep schem e, as implemented 



in CMC, offers a greater degree of inherent parallelism ( iJoshi et al. 



20011 ). 



A typical simulation starting with N ~ 10 6 up to average cluster ages of 12 Gyr using 
CMC can be run on a modern desktop computer in a reasonable amount of time (days 
to weeks). However, given the scaling of computational cost, simulations of clusters with 
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N > 10 7 stars, e.g., nuclear star clusters or galactic nuclei, will still take a prohibitive 
amount of time. Scaling up to even larger number of stars becomes possible only through 
parallelization. 

In this paper, we present in detail the latest version of CMC, which is capable of 
simulating collisional systems of up to N ~ 10 7 . In Section [5J we take a look at the 
components of the serial code and summarize both its numerical and computational aspects. 
In Section [3j we describe the flow of the parallel code, elucidating how we designed each 
part to achieve optimal performance on distributed parallel architectures. In addition, we 
describe in the Appendix, an optional CMC feature that accelerates parts of the code using 
a general purpose Graphics Processing Unit (GPU). We show a comparison of results and 
analyze the performance of the code in Section HJ Conclusions and lines of future work are 
discussed in Section El 



2. Code Overview 

2.1. Numerical Methods 

Starting with an initial spherical system of N stars in dynamical equilibrium, we 
begin by assigning to each star, a mass, a position and a velocity (radial and transverse 
components) by sampling from a distributi on function f (E, J), wher e E and J are the 



orbital energy and angular momentum (e.g., 



Binney fc Tremaine 



20081 ) . We assign positions 



to the stars in a monotonically increasing fashion, so the stars are sorted by their radial 
position initially. The system is assumed to be spherically symmetric and hence we ignore 
the direction of the position vector and transverse velocity. Following initialization, the 
serial algorithm goes through the following sequence of steps iteratively over a specified 
number of timesteps. Figure [T] shows the flowchart of our basic algorithm. 
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1. Potential calculation. The stars having been sorted by increasing radial positions in 
the cluster, the potential at a radius r, which lies between two stars at positions r k 
and r k+ i, is given by 

k N 



$(r) = G 



i=i 



mi 



where m 8 is the mass, and r iy the position of star i. It is sufficient to compute and 
store the potential § k = §{r k ) at radial distances r k (k = 1,...,N), i.e., at the 
positions of all stars. This can be done recursively as follows:: 



N+l 







N 



mi 



i=l 



$ fc = -GM k \- J 



Jk r k+l 
M fc _! = M k -m k . 



(2) 
(3) 

(4) 
(5) 



To get the potential $(r) at any other radius, one first finds k such that r k < r < r k+ i 
and then computes: 

l/r k - 1/r 



l/r k - l/r k+1 



fe+i 



(6) 



2. Tzme siep calculation. Different physical processes are resolved on different timescales. 
We use a shared timestep scheme where timesteps for all the physical processes to be 
simulated are calculated and their minimum is chosen as the global timestep for the 



current iteration . The timesteps are calculated using the fol 



Fregeau fc Rasio 



2007 



Goswami et al. 



2011 



owing expressions (see 



Chatterjee et al. 



2010L for more details): 



7T 



rcl 



(VrelY 



tt/2 32 ln( 7 iV)G 2 n ((Mi + M 2 f) ' 



(7) 
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2a=i6v^.(i?> ff (i+|^) 

T h} l = lQ^n b Xl b {a*)o(l + ° {Ma) 
T,; 1 = A^n s X 2 hs (a 2 ) a (l + 



2<x 2 X 66 (a 2 ) 

G (M) (a) 
<x 2 X bs (a 2 ) 



(8) 
(9) 
(10) 



T se = 0.001M ( . (11) 

where T re \, T co u, Tbb, Tb s and T se are the relaxation, collision, binary-binary, binary- 
single and stellar evolution timesteps respectively. Here 8 max is the maximum angle of 
deflection of the two stars undergoing a representative two-body encounter ; v re i their 
relative velocities, and n the local number density of stars; n s and are the number 
densities of single and binary stars, respectively; a is the one-dimensional velocity 
dispersion, and a is the semi-major axis. and Xb s are parameters that determine 
the minimum closeness of an interaction to be considered a strong interaction. M is 
the total mass of the cluster, T prev , the previous timestep, and Am se , the mass loss 
due to stellar evolution. 

The value of T Te \ is calculated for each star and the minimum is taken as the value of 
the global relaxation timestep. We use sliding averages over the neighboring 10 stars 
on each side to calculate the mean quantities shown in < . . . > in Equation [71 The 
other three tim e steps , T co u,Tbb and Tb s are averaged over the central 300 stars as in 



Goswami et al 



(120111 ). These choices provide a good compromise between accuracy 
and computational speed. Once these five timesteps are calculated, the smallest one 
is chosen as the timestep for the current iteration. 

3. Relaxation and strong interactions (dynamics). Depending on the physical system 
type, one of the following three operations is performed on each pair of stars (i) 
Two-body relaxation is applied based on an analytic expression for a representative 
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encounter between two nearest-neighbor stars, (ii) A binary interaction (binary-binary 
or binary-single) is simulated using F ewbody, an efficient computational toolkit for 



evolving small- N dynamical systems ( IFregeau et al 



20041 ). Fewbody performs a direct 



integration of Newton's equations for 3 or 4 bodies using the 8th-order Runge-Kutta 
Prince-Dormand method, (iii) A stellar collision is treated in the simple "sticky 
sphere" approximation, where two bodies a re merged whenever th eir radii touch, and 



all properties are changed correspondingly ( IFregeau &: Rasio 

I 



20071). 



4. Stellar Evolution. We use the SSE ( jHurley et al. 



2000|) and BSE flHurlev et all 120021 ) 



stellar evolution routines, which are based on analytic functional fits to theoretically 
calculated stellar evolu tion tracks, to simulate the evolution of single and binary stars 



( IChatterjee et al 



20101 ) 



5. New orbits computation. Consistent with the changes in the orbital properties of the 
stars following the interactions they undergo, new positions and velocities are assigned 
for orbits according to the new energy and angular momentum. Then, a new position 
is randomly sampled according to the amount of time the star spends near any given 
point along the orbit. Since this step represents a major computational bottleneck, 
we provide some details here. 

We start by finding the pericenter and apocenter distances of the star's new orbit. 
Given a star with specific energy E and angular momentum J moving in the 
gravitational potential $(r), its rosette-like orbit r(t) oscillates between two extreme 



values r min and r max , which are roots of: 



Q(r) = 2E- 2$(r) - J 2 /r 2 







(12) 



Since we only store the potential values at the positions of the stars, this equation 
cannot be analytically solved before determining the interval in which the root lies. 
In other words, we need to determine k such that Q(rk) < < Qfa+i). We use 
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the bisection method to do this. Once the interval is found, $, and thus Q, can be 
computed analytically in that interval, and so can r min and r max . 

The next step is to select a new radial position for the star between r max and r min . 
The probability is equal to the fraction of time spent by the star in dr, i.e.: 

m = %= rrZjr;^. , ; (i->) 



dr 1 


v r 




Ir max dr/ 

° 'mm ' 





T 

where the radial velocity v r = [Qir)} 1 ^ 2 . We use the von Neumann rejection technique 
to sample a position according to this probability. We take a number F which is 
everywhere larger than the probability distribution /(r). Then we draw two random 
numbers X and X' and compute 

'"O '"min ('"max 7"niiri) X , (1^) 

fo = FX' . (15) 

If the point (/o, ?~o) lies below the curve, i.e., if fo < f(r ), we take r = r as the new 
position; otherwise we reject it and draw a new point in the rectangle with a new 
pair of random numbers. We repeat this until a point below the curve is obtained. 
In our code, a slightly modified version of the method is used, since f(r) = l/|f r | 
becomes infinite a t both ends of the interval. A detailed description can be found in 



JoshietaL 



fcoooh . 



196l|) to 



6. Sort stars by radial distance. This step uses the Quicksort algorithm (iHoard 
sort the stars based on their new positions. Sorting the stars is essential to determine 
the nearest neighbors for relaxation and strong interactions, and also for computing 
the gravitational potential. 

7. Diagnostics, energy conservation, and program termination. These involve the 
computation of diagnostic values such as half-mass radius, core radius, number of 
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core stars etc., that control program termination. In addition, corrections are made 
to the kinetic energy of each particle that account for the fact that the new orbit 
has been calculated in the old potential from the previous timestep. This is done by 
adjusting the stellar velocities according to the mechanical work done on the stellar 



orbit by the time varying potential. We mainly follow the procedure in 



Stodolkiewicz 



(119821 ) except that we apply 



;he correction to t 



ratio v r /vf is preserved. See 



re ve locity components such that the 



Fregeau &: Rasiol (120071 ) for more details. Since these 



minor calculations are scattered throughout various places in the code, they are not 
explicitly shown in the flowchart (Figured]). 



2.2. Time Complexity Analysis 

In addition to the flow of the code, Figure [1] also shows the time complexity for each of 
the above steps. The timesteps, T co n, Tbb and T^ s are averaged over a fixed number of central 
stars, whereas T se is a simple factor of the previous timestep, and hence are 0(1) operations. 
The calculation of T re \ for a single star involves averaging over a fixed number of neighboring 
stars and hence has constant time complexity, 0(1). As this is done for all N stars to 
estimate their individual timesteps from which the the minimum is chosen, the timestep 
calculation scales as O(N). The effect of relaxation and strong interactions is calculated 
between pairs of stars that are radially nearest neighbors. Since these calculations involve 
constant time operations for each of the N stars, the time complexity of the perturbation 
step is O(N). Stellar evolution operates on a star-by-star basis performing operations of 
constant time for a given mass spectrum, and hence also scales as O(N). Determination 
of new orbits for each star involves finding the roots of an expression on an unstructured 
one-dimensional grid using the bisection method. The bisection method scales as 0(logN) 
and as this is done for each star, this step has a time complexity of 0(N log N). The radial 
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Fig. 1. — A flowchart of the CMC (Cluster Monte Carlo) code with the following steps. 
(1) Potential Calculation — calculates the spherically symmetric potential. (2) Timestep 
Calculation — computes a shared timestep used by all processes. (3) Relaxation and Strong 
interactions — depending on the physical system type, performs two-body relaxation, strong 
interaction, or physical collision on every pair of stars. (4) Stellar Evolution — evolves each 
star and binary using fitting formulae (5) New Positions and Orbits-samples new positions 
and orbits for stars. (6) Sorting — sorts stars by radial position. 
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sorting of stars using Quicksort has the same time complexity ( lHoardll96ll ) 



3. Parallelization 

Achieving optimal performance of codes on parallel machines require serial algorithms 
to be carefully redesigned, and hence, parallel algorithms are often very different than their 
serial counterparts and require a considerable effort to develop. The key to a successful 
algorithm is (1) good load balance, i.e., the efficient utilization of the available processing 
units, and (2) minimal communication between these units. The communication cost 
depends directly on the choice of domain decomposition, i.e, the way in which work and 
data is partitioned into smaller units for processing in parallel. For example, a good 
domain decomposition scheme for achieving ideal load balance would be the distribution 
of stars (i.e., their data) evenly among the processors, assuming the computational cost 
for processing each star is similar. This will ensure the workload is evenly balanced across 
processors given that they all perform the same number of operations, as in a Single 
Program, Multiple Data (SPMD) programming model. However, how such a decomposition 
would influence the need for communication between processing units is very specific to the 
algorithms used. In essence, a thorough knowledge of the algorithm, and its data access 
patterns is necessary for designing any efficient parallel application. 



3.1. Data Dependencies and Parallel Processing Considerations 

While deciding upon the domain decomposition, we have to take into account any 
dependencies, i.e., the way the data is accessed by various parts of the application, as they 
may directly influence both the load balance and the communication costs between the 
processing units. A good parallel algorithm should distribute the workload in such a way 
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that the right balance is struck between load balance and communication, resulting in 
optimal performance. 

In CMC, the physical data of each star (mass, angular momentum, position etc.) are 
stored in a structure, a grouping of data elements. The data for N stars are stored an array 
of such structures. For a system with p processors and N initial stars, we will first consider 
the scenario where we try to achieve ideal load balance by naively distributing the data of 
N/p stars to each processor. Fo simplicity, we will assume here that N is divisible by p, and 
analyze the data dependencies in the various modules of CMC for this decomposition. 

1. Timestep Calculation: 

(a) For calculating the relaxation time of each star we need the local density, which 
is calculated using the masses of the 10 nearest neighboring stars on either side of 
the radially sorted list of stars. A parallel program would require communication 
between processors to exchange data of the neighboring stars that are at the 
extreme ends of the local array. 

(b) Calculation of the timestep requires the computation of quantities in the 
innermost regions of the cluster, in particular the central density, and the 
half-mass radius. If the particles are distributed across many processors, 
irrespective of the specific data partitioning scheme, identification of the particle 
up to which the enclosing stellar mass equals half the total mass of the cluster 
might require communication of intermediate results between adjacent data 
partitions, and also would introduce an inherent sequentiality in the code. 

2. Relaxation and strong interactions (dynamics): 

For the perturbation calculation, pairs of neighboring stars are chosen. Communication 
might be needed depending on whether the number of stars in a processor is even or 
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odd. 

3. New orbits computation: 

To determine the new orbits of each star we use the bisection method which involves 
random accesses to the gravitational potential profile of the cluster. Since this data 
will be distributed in a parallel algorithm, communication will be needed for data 
accesses that fall outside the local subset. 

4. Sorting: 

Putting the stars in order according to their radial positions naturally needs 
communication irrespective of the decomposition. 

5. Potential Calculation: The potential calculation, as explained in Section [2J is 
inherently sequential and requires communication of intermediate results. 

6. Diagnostics and program termination: 

The diagnostic quantities that are computed on different computational units need 
to be aggregated before the end of the timestep to check for errors or termination 
conditions. 

3.2. Domain Decomposition and Algorithm Design 

Based on the considerations in the previous section, we design the algorithms and 
decompose the data according to the following scheme so as to minimize communication 
costs, and at the same time not degrading the accuracy of the results. 

Since the relaxation timestep calculation procedure introduces additional communica- 
tion cost irrespective of the choice of data partitioning, we modify it in the following way. 
Instead of using sliding averages for each star, we choose radial bins containing a fixed 
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number of stars to calculate the average quantities needed for the timestep calculation. 
We choose a bin size of 20 which gives a good trade off between computational speed 
and accuracy. We tested this new timestep calculation scheme, and we did not find any 
significant differences compared to the previous scheme. In addition, we distribute the 
stars such that the number of stars per processor is a multiple of 20 (for the timestep and 
density estimates) for the first (p — 1) processors and the rest to the last processor. Since 
2 is a multiple of 20, this removes any dependencies due to the interactions part as well. 
Our choice of data partitioning also ensures a good load balance as, in the worst case, the 
difference between the maximum and minimum number of stars among the processors could 
be at most 19. 

The gravitational potential $(r) is accessed in a complex, data dependent manner 
as we use the bisection method to determine new stellar orbits. Hence, we do not 
partition it among the processors, but maintain a copy of it on all nodes. We also do 
the same for the stellar masses and positions, to remove the dependency in the potential 
calculation. This eliminates the communication required by the new orbits and potential 
calculations. However, it introduces the requirement to keep these data arrays synchronized 
at each timestep and hence adds to the communication. We estimate the communication 
required for synchronization to be much less than what would be added by the associated 
dependencies without the duplicated arrays. 

Most modules of the code perform computations in loops over the local subset of stars 
that have been assigned to the processor. Depending on the computation, each processor 
might need to access data from the local and duplicated arrays. While the local arrays can 
be accessed simply using the loop index, any accesses of the duplicated arrays (potential, 
position, or mass) require an index transformation. For instance, let us consider a simple 
energy calculation routine that calculates the energy of each star in the local array over a 
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loop using the equation 



Ei = $ gi + 0.5 (t£ + vl) 



(16) 



where Ei,v r j and v t ^ are the energy, radial and transverse velocities of star % in the 
local array. The potential array having been duplicated across all processors, the potential 
at the position of the ith star is where gi is the global index given by the following 
transformation which directly follows from our data partitioning scheme explained above: 



gi = 



\i + id 




N 


l 




Urn, 


v _ 


\i + id 




N 


l 






V _ 



n m + id n r 



n m + 



N 



for id < 



mod pn m for id > 



N 

rim 

N 

V 1-771 



mod p , 
mod p . 



(17) 



where id the id of the processor that is executing the current piece of code, which 
ranges between to p — 1, n m is the number we would want the number of stars in each 
processor to be a multiple of, which as per our choice, is 20, and the terms between |_- • -J 
are rounded to the lowest integer. 



3.3. Parallel Flow 

The following gives an overview of the parallel workflow: 

1. Initial partitioning of the star data and distribution of duplicated arrays (mass, and 
radial positions) 

2. Calculate potential 

3. Perform interactions, stellar evolution, and new orbits calculation 

4. Sort stars by radial position in parallel 

5. Redistribute/load-balance data according to domain decomposition 
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6. Synchronize duplicated arrays (mass, and radial positions) 

Then the whole procedure is repeated starting from step 2. The first step is to distribute 
the initial data among processors as per our data partitioning scheme mentioned in Section 
13.21 This is done only once per simulation. This also includes the distribution of a copy 
of the duplicated arrays. In Section [2J we saw that the calculation of the potential is 
inherently sequential requiring communication, since it is calculated recursively starting 
from the outermost star and using the intermediate result to compute the potential of 
the inner stars. However, since now every processor has an entire copy of the potential, 
the positions and mass arrays, it can calculate the potential independently. This does 
not give a performance gain since there is no division of workload, however nullifies the 
communication requirement. Performing interactions, stellar evolution and new orbits 



calculation too don't require any communication whatsoever d ue to our choice o 



data 



1970) as 



partitioning and use of duplicated arrays. We use Sample Sort ( IFraser fc McKellarl . 
the parallel sorting algorithm (see Section [33]). With a wise choice of parameters, Sample 
Sort can provide a near equal distribution of particles among processors. However, since 
we require the data to be partitioned in a very specific way, following the sort, we employ 
a redistribution/load-balancing phase to redistribute the sorted data as per our chosen 
domain decomposition. Sorting and redistribution are steps that naturally require the most 
communication. Before the beginning of the next timestep, we synchronize the duplicated 
arrays on all processors which requires message passing. Some non-trivial communication is 
also required at various places in the code to collect and synchronize diagnostic values. 



3.4. Sorting 



The input to a parallel sorting algorithm consists of a set of data elements (properties 
of N stars in our case), each having a key (radial positions in our case) based on which the 
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data need to be sorted. An efficient parallel sorting algorithm should collectively sort data 
owned by individual processors in such a way that their utilization is maximum, and at the 
same time the cost of redistribution of data across processors is kept to a minimum. In order 
to implement a parallel sorting algorithm, there are a wide array of solutions to consider. 
Each of these solutions cater to a parallel application and/or in some cases a particular 
machine architecture/platform. In general, a parallel programmer should consider many 
different design decisions with careful consideration of the application characteristics. The 
proper assessment of application knowledge often can suggest which initial distributions of 
data among processors are likely to occur, allowing the programmer to implement a sorting 
algorithm that works effectively for that application. 

The importance of load balance is also immense, since the application's execution time 
is typically bound by the local execution time of the most overloaded processor. Since our 
choice of domain decomposition requires a fairly good load balance, our sorting algorithm 
should ensure that the final distribution of keys among processors closely agree with our 
decomposition. This is a challenge since during their evolution, dense star clusters have 
a very strong density contrast, and stars are very unevenly distributed radially with a 
substantial number of stars in the high density core, and a the rest in the extremely low 
density halo. A good parallel sorting algorithm for our application should be able to judge 
the distribution of keys so as to deliver almost equal amount of data in each processor at 
the end of the sorting phase. 

Sample Sort is a splitter-based parallel sorting algorithm that performs a load balanced 
splitting of the data by sampling the global key set. This sampling helps judge the initial 
distribution of keys and accordingly perform the sort, hence resulting in a near-equal 
distribution of data among processors. Given N data elements distributed across p 
processors, Sample Sort consists of 5 phases, shown in Figure |2j 
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Fig. 2. — The Sample Sort Algorithm 
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1. Sort Local Data: Each processor has a contiguous block of data in accordance 
with our data partition (close to N/p in number, see Section l3T2"j) . Each processor, in 
parallel, sorts this local block using sequential Quicksort. 

2. Sampling: All p processors, in parallel, uniformly sample s keys to form a 
representative sample of the locally sorted block of data. These set of p samples, of 
size s from each processor, are collected on one designated processor. This aggregated 
array of samples represent the distribution of the entire set of keys. 

3. Splitter Selection: The combined sample key array is sorted, and keys at indices 
s, 2s, 3s, (p — l)s are selected as splitters and are broadcasted to all processors. 

4. Exchange partitions: The positions of the (p — 1) splitter points in the local array 
are determined by each processor using binary search; this splits the local array into 
p partitions. In parallel, each processor retains the zth partition and sends the jth 
partition to the jth processor, i.e. each processor keeps 1 partition and distributes 
(jp — 1) partitions. At the same time it receives 1 partition from every other processor. 
This might not be true always, particularly in cases of a poor choice of sample size s, 
some splitter points might lie outside the local data array and hence some processors 
might not send data to all (p — 1) processors but only a subset of them. However, for 
the current discussion we will assume a good choice of sample size is made. 

5. Merge Partitions: Each processor, in parallel, merges its p partitions into a single 
list and sorts it. 



One chooses a value for the sample size s so as to sample the distribution of keys accurately, 
and hence this parameter varies depending on the distribution o f keys as well t le data size 
N. More comments on the choice of sample size can be found in iLi et al.l (119931 ). 
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Let us now try to derive the time complexity of Sample Sort. We assume a hypercube 
parallel architecture with cut-through routing, a message routing mechanism in which nodes 
immediately forward messages to subsequent ones as they arrivd_|. The local sort (Phase 
1) requires 0(— log(— )) since there are close to N/p keys per processor. The selection of 
s sample keys (Phase 2, part 1) requires O(s) time. Collecting s keys from p processors 
on to one of the processors (Phase 2, part 2) is a single-node gather operation for which 
the time required is O(sp). The time to sort these sp samples is 0((sp) log(sp)), and the 
time to select {p — 1) splitters (Phase 3, part 1) is 0{p). The splitters are sent to all the 
other processors (Phase 3, part 2) using an one-to-all broadcast which takes 0(p\ogp) 
time. To place the splitters in the local array (Phase 4, part 1) using binary search takes 
(9(plog(^)). The all-to-all communication that follows (Phase 4, part 2) costs a worst case 
time of 0(— ) + 0(p\ogp). The final step that merges and sorts the partitions would cost 

log(^)) time. So the time complexity of the entire algorithm becomes 



'N N\ ( 7V N 

O ( — log— + 0((sp)\og(sp)) + plog— ) +<D(N/p) + 0(p\ogp) . (U 
p p ) \ p 



3.5. Data Redistribution 



In theory, with a good choice of sample size, Sampl e Sort guarant ees to distribute the 



19931 ). However, we would 



particles evenly among processors within a factor of two (ILi et al. 
like to partition the data in such a way that each processor has close to N/p elements, and 
at the same time being multiple of 20. Since the final distribution of data among processors 
after Sample Sort is not deterministic, we include an additional communication phase to 



1 This is faster than when the nodes wait until the entire message is received and stored 
before forwarding, also known as store-and-forward routing. 
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ensure the required domain decomposition is maintained. 

We first calculate the global splitter points that would partition the entire data among 
processors as per our required data partitioning scheme. We then perform a parallel prefix 
reduction (MP I _Exscan), so each processor knows the cumulative number of stars that 
ended up in the processors before it. Using this, it can calculate the range of global indices 
corresponding to the list of stars it currently holds. Now, each processor checks if any of 
the global splitter points other than its own map on to its local array, and if they do, it 
marks the corresponding chunk of data to be sent to the respective processor. Then, we 
perform an all-to-all communication, after which the data on each processor is sorted by 
simply rearranging the received chunks. 

Let us consider an example where there are 4 processors and they receive 100, 130, 
140 and 80 stars respectively after the sorting phase. For a total of 450 stars to be divided 
among 4 processors, the expected distribution would be 120, 120, 100 and 110 stars 
respectively as per our scheme. The corresponding global splitter points would be 120, 240, 
and 340. By doing the prefix reduction on the received number of stars, processor 3, for 
instance, knows there are in total 230 stars in processors 1 and 2 put together. Since it 
received 140 stars after sorting, it also calculates that it has stars with indices between 231 
- 370. Now, two global splitter points, i.e., 240 and 340 of processors 2 and 4 lie within this 
index range, and hence the corresponding stars, i.e., 231 - 240 and 341 - 370 need to be 
sent to processors 2 and 4 respectively. These data chunks are exchanged by performing an 
all-to-all communication, followed by a rearrangement if required. 
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3.6. Parallel Random Number Generation 



The accuracy of results of a parallel Monte Carlo code depends in general on both the 
quality of the pseudo-random number generators (PRNGs) used and the approach adopted 
to generate them in parallel. In our case we need to sample events with very low probability, 
such as physical collisions between stars or binary interactions, which makes it necessary for 
the generated random numbers to be distributed very evenly. More specifically, given N r 
random numbers uniformly distributed in the interval (0, 1), it would be preferable to have 
one random number in each sub-interval of size 1/N r . A special class of random number 
generators for which this property holds in even higher dimensions are the maximally 



equidistributed generators and we choose here t he popular and 



linear feedback shift register" (LFSR) PRNG in iL'Ecuveri (119991 ). 



ast "combined Tausworthe 



PRNGs use a set of state variables which are used to calculate random numbers. Every 
time a random number is generated, the values of these states are updated. A PRNG can 
be initialized to an arbitrary starting state using a random seed. When initialized with the 
same seed, a PRNG will always produce the same exact sequence. The maximum length of 
the sequence before it begins to repeat is called the period of the PRNG. The com bined 



Tausworthe LFSR PRNG we are using here has a period of ~ 2 (IL'Ecuyer 



19991 1 . 



While generating random numbers in parallel, special care has to be taken to ensure 
statistical independence of the results calculated on each processor. For instance, to 
implement a parallel version, we could simply allocate separate state variables for the 
PRNGs on each processor and initialize them with a different random seed. However, 
choosing different seeds does not guarantee statistical independence between these streams. 

An efficient way to produce multiple statistically independent streams is to divide a 
single ran dom sequence into subsequences, with their starting states calculated using jump 



functions ( iCollin; 



20081 ). Taking a starting seed and a jump displacement, D, as inputs, 
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jump functions can very quickly generate the Dth state of the random sequence. We use 
this method to generate multiple starting states, one for each processor, by repeatedly 
applying the jump function and saving intermediate results. The jump displacement is 
chosen as the maximum number of random numbers each processor might require for the 
entire simulation while still providing for a sufficiently large number of streams. Based on 
that we choose D = 2 80 . 



3.7. Implementation 



CMC is written in C, with some pa rts in Fortran. We use the Message Passing 



Interface (MPI) library ( iLusk et al. 



19961 ) to handle communication. The MPI standard 
is highly portable to different parallel architectures, and available on practically every 
supercomputing p latform in use today. The most common MPI communication calls (see 



Gropp et al 



19941 for a detailed description) used in our code are: 



1. M PI _Allreduce/ MPI Jleduce 

MPLReduce combines the elements of a distributed array by cumulatively applying 
a user specified operation as to reduce the array to a single value. For instance, 
when the operation is addition then the resulting value is the sum of all elements. 
MPLAllreduce is MPLReduce with the difference that the result is distributed to all 
processors. The call is used in the following parts of the code: 

(a) Diagnostics and program termination: accumulating diagnostic quantities such 
as the half-mass radius, 77,, and the core radius, r c . 

(b) Timestep calculation: to find the minimum timestep of all stars across processors. 

(c) Sorting and data redistribution: Since stars are created and lost throughout the 
simulation, N is not a constant and changes during a timestep. It is calculated 
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during the sorting step by summing up the local number of stars on each 
processor. 

2. M PI _AU gather /MPI_Gather 

In MPLGather each process sends the contents of its local array to the root, or master, 
process, which then concatenates all received arrays into one. In MPLAllgather this 
concatenated array is distributed to all nodes. The calls are used in the following 
parts of the code: 

(a) Synchronization of duplicated arrays, i.e., $(r) and the stellar masses. 

(b) Sorting and data redistribution: to gather samples contributed by all processors 
on to a single node. See Section 13.41 for details. 

3. MPIMltoall 

In MPLAlltoall the send and receive array is divided equally into p sub-arrays, where 
p is the number of processors. The position of each sub-arryay within the send or 
receive array determines to or from which processor the data is sent or received, 
respectively. MPLAlltoall is only used in "Sorting and data redistribution". See 
Section 13.41 for details. 

4. MPPBcast 

In MPLBcast an array is sent from the root, or master, node to all other processes. 
Used in "Sorting and data redistribution" to communicate the new splitter points 
from a single specified processor to all other. 

5. MPI_Scan/MPI_Exscan 

MPLScan essentially carries out a reduction as in MPLAllreduce except that 
processor % receives the result of the reduction over the data of processors to i. In 



28 



MPLExscan the data is reduced over processors to i-1. MPLScan/Exscan is used in 
Sorting and data redistribution. See Section 1331 for details. 

We also use a number of optimizations for the MPI communication calls. Some examples 
include using MPI derived datatypes for data packing and sending, and combining multiple 
parallel reduction calls for the diagnostic quantities by packing all the data into a single 
buffer and performing the reduction together using a single call which is more efficient. 
However, the overlapping of communication calls with computation we did not explore so 
far, but intend to do so in the future. 



All our test calculations were carried out on Hopper, a Cray XE6 supercomputer at 



NERSCEl that has a peak performance of 1.28 Petaflops, 153,216 processor-cores for running 
scientific applications, 212 TB of memory, and 2 Petabytes of online disk storage. 



In the parallel version, since we use several different random sequences within one 
simulation, the way random numbers are assigned to stars is different from the serial 
version. This would bring in a problem of inconsistency in the results between serial and 
parallel runs, leaving us with no simple way to verify the correctness of the results of our 
parallel version. We tackle this problem by changing the serial version such that it uses the 
same mapping of random number streams to stars as followed in the parallel version. This 
emulation allows us to compare the results of the parallel version with that of the serial 

2 http : / /www . nersc . gov/ 



4. 



Results 




4.1. Accuracy and Reproducibility 
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version. However, note that parallel runs with different numbers of processors will result in 
a different mapping of streams to stars, making a separate serial run necessary to compare 
results for each case. 

We ran test simulations for 50 timesteps with the parallel and serial versions, using 
or emulating up to a few processors, respectively. Initially, the clusters had N = 10 5 , 10 6 , 
and 10 stars with positions and velocities distributed according to a Plummer model. By 
comparing the positions and masses of every star in the cluster at the end of the simulations, 
we found that the parallel and corresponding serial results were matching accurately down 
to the last significant digit (all variables are in double precision). We also compared a few 
diagnostic quantities, such as the core radius, and core density, and they were matching as 
well, except for the last four significant digits. This slight loss in accuracy is due to the 
MPI_Reduce calls, which perform cumulative operations (sum, max, min etc.) on data 
distributed among the processors. This introduces different round-off errors since one does 
not have control over the order in which the data aggregation is done. 



4.2. Comparison to Theoretical Predictions 



In order to verify that our code reproduce s well-known theoretical r esults, we calculate 



the evolution of single-mass Plummer spheres (IBinney fc Tremaine 



20081 ) until core collapse 



(without any binaries or stellar evolution). With 10 5 , 10 6 , and 10 7 stars, this is the first time 
that collisional iV-body simulations covering three orders of magnitude up to N = 10 7 stars 
have been carried out. We used 128, 256 and 1024 processors for these runs, respectively, 
which deliver peak performance for the three cases (see Section I4T31) . The wall clock run 
times for these simulations were 11.4 mins, 1.17 hrs and 12.8 hrs, respectively. 



One remarkable property realized early on is that the cluster evolution proceeds 
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Fig. 3. — Evolution of an isolated Plummer model showing the ratio of core radius 
to half- mass radius, r c /r h (top), and the core density, p c (bottom). Time is in ini- 
tial half-mass relaxation times. The various lines represent different particle numbers, 
N = 10 5 (dashed), 10 6 (dotted), 10 7 (solid) . 
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asymptotically in a self-similar fashion, th at is, the c 
scale and normalization at late times (e.g 



Cohn 



1980 



uster density profiles differ only in 



Binney fc Tremaine 



20081 ). This can 



be clearly seen in Figure IU where we plot the density profile of the cluster with N = 10 7 at 
various times during its evolution to core collapse. For each profile we observe a well-defined 

ightly large r 



core and a power law density profile, p oc r _/3 , with (3 ~ 2.3. This is only s. 
than the value (3 = 2.23 usually quoted in the literature (first obtained by ( 1Cohnlll980l )). 
The slight discrepancy, however, arises probably because the density profiles in Figure H] 
were taken at a time when core collapse has not yet proceeded far enough. 

Figure [3] shows the evolution of the core radius, r c (t)/r h (t), as well as the core density, 
p c {t), for the models with N = 10 5 , 10 6 and 10 7 sta rs. We use the notation r c for the 



density-weighted core radius (ICasertano fc Hut 



19851 ). and p c for the core density, defined 



as 



Pc 



T,iP^ 



(19) 



19851 ). 



where pi is the 6th order density estimator around the zth star (ICasertano fc Hut 
The core density rho c is expressed in units of M c r~£., where M c is the total cluster mass 
and r vir is the virial radius, defined as GM^ /(2E ) with E the initial total gravitational 
energy of the cluster, and G the constant of gravity. One can immediately see that all 
three clusters reach core collapse at similar times, with t = t cc ~ 17.4t r h. l6.7t r h and 16 .6 t r h 



respectively, where i r h is the initial half-mass relaxation time defined as (ISpitzer 



19871 ) 



0.138iV 



<^rh 



1/2 



(20) 



ln( 7 iV) \GM 

with 7 in the Coulomb logarithm chosen to be 0.1. Thus, the core collapse times are 
not only in very good agreement with previously published results that all lie between 15 
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and 18 t r h (see, e.g., iFreitag & Benzll200ll . for an overview), but also confirm that our code 
can reproduce the scaling of t cc with t T ^ within « 10% over three orders of magnitude in 
N. The scaling behavior becomes even better, with deviations < 1% , if only the runs with 
N > 10 6 are considered. The larger deviation in i cc between the N = 10 5 run and the other 
two are probably because of the larger stochastic variations in low N runs. 

Another consequence of the self-similarity of the clust er evolution to core collapse is 



that — (3 ~ log(p c (t))/log(r c (i)) (IBinney fc Tremaine 



20081 ). which means that the decrease 



in r c leads to an increase in p c at a rate that is related to the shape of the density profile. 
Indeed, from Figure [3] one can see that the shape of p c (t) mirrors the shape of r c (t) as 
expected. Even more so, from Figure [5j we find that the power-law slope of p c (?"c) becomes 
(3 = —2.24 close to core-collapse (r r /r„ ir < 3 x 10~ 4 ) , which is in excellent agreement with 



the corresponding = —2.23 found by 



Cohnl (1l980l ). It is worth noting that (3 slowly 



changes with r c , increasing from around f3 = —2.5 and eventually converging to (3 = —2.24, 
which is also reflected in the power-law slopes of the density profiles we obtain from our 
simulations for increasing times (Figure H]). 

Apart from the self-similar behavior, we also find that there is very little mass loss 
(< 1%), and hence very little energy is carried away by escaping stars, in agreement with 



theoretical expectations (e.g. 



Lightman fc Shapiro 



19781 ). Finally, we find that our code 



conserves total energy to better than 0.04% throughout the entire simulation. 



4.3. Performance Analysis 

We tested our parallel code for 3 cluster models with N— 10 5 , 10 6 , and 10 7 stars, with 
an initial Plummer density profile. We used 1 to 1024 processors and measured the total 
time taken, and time taken by various parts of the code for up to 50 timesteps. For the 
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Fig. 4. — Evolution of the density profile at various times during core collapse for the 
N = 10 7 run. The dashed line shows the slope of the expected power-law density profile 
(IHeggie & Stevenson! 1 198a bohnlll980f ). 
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Fig. 5. — Core density as function of r c /r vir . The density is expressed in units of 
p V i r = M c r~f T . The dashed line shows the power-law slope expected for a cluster close 
to c ore collapse based on the self- similarity of the cluster evolution in that regime (see, 



e.g. 



Binney fc Tremaine 



20081 ). In our simulation this regime appears to be reached for 



r c /r V ir < 3 x 10 4 , and a power-law fit to the core density in this range results in a slope of 
-2.24. 
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Fig. 6. — Scaling of the wall-clock time with the number of processors, p, for a parallel 
simulation of up to 50 timesteps. The various lines represent different particle numbers (see 
legend). We observe a near-linear scaling for up to 64 processors. 
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sample size s in the Sample Sort algorithm, we chose 128, 256 and 1024 respectively, for the 
three N values. 

The timing results are shown in Figure [6] and a corresponding plot of the speedups in 
Figure [71 These results do not include the time taken for initialization. We can see that the 
speedup is nearly linear up to 64 processors for all three runs, after which there is a gradual 
decrease followed by saturation. For the N = 10 5 and 10 6 case, we also notice a dip after 
the saturation, also expected for the N = 10 7 case for a larger number of processors than 
we consider here. We also see that the number of processors for which the speedup peaks 
is different for each value of N, and gradually increases with N. The peak is seen at 256 
processors for the N = 10 5 run, somewhere between 256 and 512 for the N = 10 6 run, and 
1024 for the N = 10 7 run. The maximum speedups observed are around 60 x, 100 x, and 
220 x for the three cases respectively. 

Figure [8] shows the scaling of the time taken by various modules of our code for the 
N = 10 6 run. One can observe that the dynamics, stellar evolution, and orbit calculation 
modules achieve perfectly linear scaling. The ones that do not scale as well are sorting and 
the "rest", which include diagnostics, potential calculation and timestep calculation. As 
the number of processors increase, the linear scaling of the former three parts of the code 
reduces their time to very small values, in turn letting the parts that do not scale well 
dominate the runtime. This is the reason for the trend observed in the total execution time 
and speedup plots. We can also particularly see that the time taken by sorting starts to 
increase after reaching a minimum, and this explains a similar observation in the previous 
plots as well. 

Figure [9] shows the experimental timing results of our sorting step for the three N 
values plotted against the theoretical values calculated using Equation [181 Since the entire 
star data is communicated during the all-to-all communication phase of the sort and not 
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Fig. 7. — Speedup of our parallel code as a function of the number of processors, p. The 
various lines represent different particle numbers (see legend). The straight line represents 
the ideal speedup, which is the case when the speedup equals p. We observe that for all 
three cases, the speedup closely follows the ideal speedup line for up to 64 processors. 
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Fig. 8. — Time taken by the various parts of the code for a parallel run with N = 10 6 
stars. The various lines represent the different modules of the algorithm (see legend). The 
module denoted by "Rest" is the cumulative time taken by diagnostics, potential calculation 
and timestep calculation steps. One can observe that the dynamics, stellar evolution, and 
new orbits computation modules scale linearly whereas the other two exhibit relatively poor 
scaling beyond 64 processors. 
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Fig. 9. — Time taken by the sorting routine of the parallel code plotted against the theoretical 
time complexity of Sample Sort based on Equation [TS] for various values of N. Appropriate 
proportionality constants were used based on the data size transported during the all-to-all 
communication phase of Sample Sort (see Section |3~4"|) . 
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just the keys, and a data size of a single star is 46 times greater than that of a single key 
in our code, we multiply the O(Nfp) term of Equation HH] with a proportionality constant 
of 46. We see that for all three cases, the expected time linearly decreases, reaching a 
minimum after which it increases. This implies that for every data size, there is a threshold 
for the number of processors that would achieve optimal performance beyond which it 
will worsen. The main reason is that, as the number of processors increases, smaller 
amounts of data are distributed across many processors, which makes the communication 
overhead dominant. In addition, the number of samples to be accumulated and sorted on a 
single node (Phase 2 and 3 of Sample Sort, see Section I3T4"|) increases with the number of 
processors, and hence this phase tends to become a bottleneck as well. From Figure [9j we 
also see that our implementation very closely follows the trend predicted by the theoretical 
complexity formula. In addition, the number of processors at which maximum performance 
is attained match fairly closely as well. 

The other parts of the code that do not scale well include diagnostics, potential 
calculation and timestep calculation. The computation of diagnostics and program 
termination related quantities require a number of collective communication calls to 
aggregate data across processors, and this is difficult to parallelize efficiently. However, 
there might be potential to improve this scaling by interleaving the collective communication 
calls with computation. While the timestep calculation is embarrassingly parallel and can 
be expected to scale well, the time taken for potential calculation would remain constant 
irrespective of the number of processors used, since every processor computes the entire 
potential array, and has little scope for optimization. A thorough profiling of the poorly 
scaling parts is necessary in the future to identify the dominant among these bottlenecks, 
and prioritize optimization steps. 

We noted earlier that the total execution time followed a very similar trend as sorting, 
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however, the places at which the minimum execution time was observed are not the same, 
but shifted to the right. We can now infer this to be a result of the linearly-scaling parts of 
the code pushing these points to the right until the relatively poorly scaling parts dominate 
the runtime. 

5. Conclusions 

We presented a new parallel code, CMC (Cluster Monte Carlo), for simulating 
collisional iV-body systems with up to iV ~ 10 7 . In order to maintain a platform- 
independent implementation, we adopt the Message Passing Interface (MPI) library for 
communication. The parallelization scheme uses a domain decomposition that guarantees 
a near-equal distribution of data among processors to provide a good balance of workload 
among processors, and at the same time minimizes the communication requirements by 
various modules of the code. Our code is based on the Henon Monte Carlo method, with 
algorithmic modifications including a parallel random number generation scheme, and a 
parallel sorting algorithm. We presented the first collisional iV-body simulations of star 
clusters with N covering three orders of magnitude and reaching up to N = 10 7 . The core 
collapse times obtained in our simulations are in good agreement with previous studies, 
providing basic validation of our code. We also tested our implementation on 1 to 1024 
processors. The code scales linearly up to 64 processors for all cases considered, after which 
it saturates, which we find to be characteristic of the parallel sorting algorithm. The overall 
performance of the parallelization is impressive, delivering maximum speedups of up to 
220 x for N = 10 7 . 

Interesting future lines of work may include reducing the communication overhead 
by overlapping communication with computation. In addition to the distributed memory 
parallel version, CMC has also an optional feature that accelerates parts of the algorithm 



-42 - 



using a general purpose Graphics Processing Unit (GPU), described in the Appendix. 
An important next step towards reaching even higher N values is the development of a 
hybrid code which can run on heterogeneous distributed architectures with GPUs. With 
progress along these lines, we may be able to reach the domain of galactic nuclei for the 
first time. Many galactic nuclei contain massive, dense star clusters, so-called nuclear star 
cluste rs, which ar e thought to be significantly linked to the evolution of their host galaxies 



(e.g., 



Boker 



20101 ). and their 



the formation of the galaxy (IMerritt 



propert 



ies m ight still reflect to some extent the details of 



2010|). In addition, with their much larger masses 
and escape velocities, galactic nuclei are likely to retain many more stellar-mass black 
holes than globular clusters, and, thus, might significantly contribute to the black hole 
binary merger rate, as w ell as to the gravitational wave detection rate of advanced LIGO 



( Miller fc Lauburg 



20091 ). Therefore, the study of galactic nuclei with a fully self-consistent 
dynamical code such as CMC has the potential to make strong predictions for future 
gravitational wave detection missions, and might give further insights into the evolution of 



galaxies. 
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A. GPU Acceleration of the Orbit Computation 

An optional feature of the CMC code is the GPU acceleration of the orbit computation 
that consists of finding r min and r max of each stellar orbit and sampling a new orbital position 
(see Section [2]). As we have shown, the time complexity of both parts is 0(N log N), and 
the orbit and new position for each star can be determined independently from the other 
stars. This makes the orbit computation particularly well suited to being calculated on a 
GPU, not only because of the inherent parallelism of the algorithm, but also for the large 
number of memory accesses, which also scale as 0(N log N), and, thus, allow us to take 
advantage of the fast GPU memory. 

Based on the structure of the algorithm, our implementation assigns one thread on the 
GPU to do the computations for one star. This ensures minimal data dependency between 
the threads since the same set of operations are performed on different data, and makes 
the bisection method and rejection technique implementations naturally suited for SIMD 
(Single Instruction, Multiple Data) architectures, such as the GPU. In the following we 
describe the specific adaptations of the serial implementation of the algorithms to the GPU 
architecture and present performance results. 

A.l. Memory Access Optimization 

To harness the high computation power of the GPU, it is very essential to have a good 
understanding of its memory hierarchy in order to develop strategies that reduce memory 
access latency. The first step towards optimizing memory accesses is to ensure that memory 
transfer between the host and the GPU is kept to a minimum. Another important factor 
that needs to be considered is global memory coalescing in the GPU which could cause a 
great difference in performance. When a GPU kernel accesses global memory, all threads in 
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groups of a half- warp access a bank of memory at the same time (INvidial 120101 ) . Coalescing 
of memory accesses happens when data requested by these groups of threads are located in 
contiguous memory addresses, in which case they can be read in one (or very few number 
of) access(es). Hence, whether data is coalesced or not has a significant impact on an 
application's performance as it determines the degree of memory access parallelism. In 
CMC, the physical properties of each star are stored in a C structure, containing 46 double 
precision variables. The N stars are stored in an array of such C structures. 

Figure [TU] gives a schematic representation of the data reorganization. At the top, the 
original data layout is shown, i.e., an array of structures. The kernels we parallelize only 
require 5 among the 46 variables present in the star structure: radial distance, r, mass 
m, energy, E, angular momentum, J, and potential at r, 0, which are shown in different 
color. To achieve coalesced memory accesses, we need to pack the data before transferring 
it to the GPU in a way that they would be stored in contiguous memory locations in the 
GPU global memory. A number of memory accesses involve the same set of properties for 
different stars being accessed together by these kernels since one thread works on the data 
of one star. Hence, we extract and pack these into separate, contiguous arrays, one for each 
property. This ensures that the memory accesses in the GPU will be mostly coalesced. 
Also, by extracting and packing only the 5 properties required by the parallel kernels, we 
minimize the data transfer between the CPU and GPU. 



A. 2. Random Number Generation 

For the generation of random numbers we use the same combined Tausworthe generator 
and parallel implementation scheme as described in Section 13.61 That is, for each thread 
that samples the position of a star, there is one random stream with an initial state that has 
been obtained by jumping multiple times in a seeded random sequence to ensure statistical 
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independence between streams. As we will see later, to achieve optimal performance, 6000 
to 8000 threads, and, thus, streams, are required. This can be easily accomodated by one 
random sequence, as for our jump distance of 2 80 and a random number generator period 
of ~ 2 113 , m 10 10 non-overlapping streams can be generated. Once generated on the host, 
the initial stream states are transferred to the GPU global memory. Each thread reads the 
respective starting state from the memory and produces random numbers independently. 



A. 3. Performance Results 

All our simulations are carried out on a 2.6 GHz AMD PhenomTM Quad-Core Processor 
with 2 GB of RAM per core and an NVIDIA GTX280 GPU, with 30 multiprocessors, 
and 1 GB of RAM. The algorithms have been written in the CUDA C language, and 
were compiled with the version 3.1 of the CUDA compiler. All computations are done in 
double precision, using the only Double Precision Unit (DPU) in each multiprocessor on 
the GTX280 GPU. 

We collect the timing results for 5 simulation timesteps of a single-mass cluster with a 
Plummer density profile, and siz es ranging from 10 6 to 7 x 10 6 stars , encompassing « 25% 



of all globular cluster sizes (e.g., 



McLaughlin &: van der Marel 



20051 ) 



Figure [TT] compares the GPU and CPU runtimes. Figure [12] shows the speedup of 
the new orbits computation part and the bisection and rejection kernels individually. All 
speedup values are with respect to the code performance on a single CPU. We see that the 
average speedups for the rejection and bisection kernels are 22 and 31, respectively. This 
is due to the difference in the number of floating point operations between the two kernels 
which is a factor of 10. This makes a major difference on the CPU but not on the GPU as 
it has more arithmetic logic units (ALUs). Indeed, the bisection and rejection kernels take 
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Fig. 10. — Data coalescing strategy used to strip the original star data structure and pack 
into contiguous arrays before transferring them to the GPU. The original data layout is an 
array of structures (top), which is reorganized into fewer separated arrays (bottom) for the 
variables used in the new orbits computation. 
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Fig. 11. — Comparison of total runtimes of the sequential and parallelized kernels for various 
N. 
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Fig. 12. — Overall and individual speedups of the bisection and rejection kernels. The mean 
overall speedup was observed to be 28 x, with the individual mean speedups being 22 x and 
31 x for the bisection and rejection kernels respectively. 
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about equal amount of time on the GPU for all N. This also indicates that the performance 
of these kernels is only limited by the memory bandwidth as they roughly require the same 
amount of global memory accesses. 

We also observe that the total speedup increases slightly as the data size increases. 

In general, we obtain very good scalability. Analyzing the dependence of the runtime 
on N in Figure [11] we find that the GPU runtimes follow closely the kernel's complexity of 
0(N log N). The runtimes on the CPU, on the other hand, have a steeper scaling with N, 
such that the run with N = 7 x 10 6 takes a factor of 11 longer than with N = 10 6 , instead 
of the expected factor of 8. The reason for the somewhat worse scaling of the runtime on 
the CPU is not yet clear and remains to be investigated in the future. 

Note that as the memory transfer between the CPU and GPU is currently not 
optimized, our speedup calculations do not include that overhead. However, as we transfer 
only a subset of the entire data for each star, there is the potential space for improvement 
to interleave kernel computations with data transfer and substantially reduce this overhead. 

Finally, we looked at the influence of GPU specific parameters on the runtime. In 
order to improve performance and utilization of the 30 multi-processors, we partitioned our 
data space into a one- dimensional grid of blocks on the GPU. Due to the complexity of the 
expressions involved in the calculations of orbital positions, our kernels use a significant 
amount of registers (64 registers per thread). Thus, the block dimension is restricted to 
256 threads per block as the GTX280 GPU has only 16384 registers per block. To analyze 
the performance, we first made a parameter scan in the block and grid dimensions by 
varying the block sizes from 64 to 256 and the grid sizes from 12 to 72. Figure [13] shows 
the total runtime of all kernels as a function of the total number of active threads on the 
GPU. As expected, the runtime decreases with increasing thread number but saturates at 
around 6000 threads. The saturation is most likely due to the finite memory bandwidth, 
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Fig. 13. — Total runtime of all kernels over the total number of threads. The runtime 
decreases with increasing thread number and saturates at around 6000 threads, which is 
expected to be due to the finite memory bandwidth. 
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as we noted before that the runtime of the kernels is mainly determined by the number of 
memory accesses as opposed to the number of floating point operations. One can also see 
that the curve shows little scatter, ps 0.1 s, which means that the specific size of a single 
block of threads has a minor effect on performance. We furthermore find that for a given 
total number of threads, the runtime is shortest when the total number of thread blocks is 
a multiple of the number of multi-processors, in our case 30. 
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